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ABSTRACT 

A Monte Carlo Study was conducted to evaluate ,siz 
models commonly used to evaluate change. !the results, revealed 
specific pr-oblems with each. Analysis of covariance and analysis of 
variance of residualized gain scores appeared to substantially and 
consistently overestimate the change effects* Multiple factor 
analysis of variance models utilizing pretest and post-test scores 
yielded ini:alidly loy F ratios. The analysis of variance of 
difference scores an^. th6 multiple factor analysis of variance using 
repeated measures were the only models which adeguately controlled 
for pre-tresttment differences; however^ they appeared to be robust 
only when the error level is 50% ot more. This places serious doubt 
regarding published findings, and theories based upon change score 
analyses. When collecting data which. have an error level less than 
50% (which is true in most situations) ^ ^ change score analysis is 
entirely inadvisable until an alternative procedure is developed. 
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Charles R. Corder-Bolz 

ABSTRACT 

\ 

V 

A Monte Carlo Study was conducted to evaluate six models commonly used 
to evaluate change. The results revealed specific problems with each. 
Analysis of covariance and analysis of variance of residual ized' gain 
scores appeared to substantirff ly and consistently overestimate the 
change effects. Multiple factor analysis of variance models utiliz- 
ing pretest and post-test scores yielded invalidly low £ ratios. The 
analysis of variance of difference scores and the multiple factor 

' analysis of variance using repeated measures were the only models 
which adequately controlled for pre-treatment differences; however, 
they appeared to be robust only when the error level is 50% or more. 
This places serious doubt regarding pubTished findings, and theories 
based upon change scoi*e analyses. When an inv-estigator is collecting 
data which have an error level less than 50^^ (which is true in most 

V^situations) , then a change score analysis is entirely inadvisable 
until an alternative procedure is developed. 



A MONTE CARLO STUDY OF SIX MODELS OF CHANGE 
Charles R. Corder-Bolz 

The desire to observe and understand the forces tt^at cause change 
is fundamental to educational and social scientists. Change phenom- 
ena include such intriguing aspects of life as the acquiisltion of 
knowledge, the reduction of anxiety, positive changes in self^ 
concept, and the increase of. productivity in human interactions. 
These phenomena are most validly viewed within the context of change. 
Therefore, the concept of change is basic to the educational and 
social science researcher. The measurement of various constructs and 
their change has reached a high degree of sophistication. The very 
reliability and validity of such measurements can be estimated. The 
scientist can choose from a wide array of measurement instruments that 
include questionnaires, interview techniques, and observation pro- 
cedures. The critical ;issue, however, is one of how the scientist can 
evaluate the observed changes and choose from among various contrast- 
ing hypotheses regarding the nature of the change phenomena. 

There are two broad categories of methodologies of the study of 
change. The first category includes the various approaches based upon 
experimental design considerations." Characteristically, experimental 
design appi:;6aches utilize two or more parallel groups which receive 
different treatments. The analysis of variance model can then be used 
to analyze the post-treatment scores. The intent is to assess change 



through the observation of difference? between groups caused by the 
various treatments administered to the different groups. Random 
assignment to the groups should result in independent and equivalent 
samples. Unfortunately, true randomization is difficult in the 
"real" world and, thus, there are often important differences between 
the groups prior to the administration of the treatments. These pre- 
treatment differences sometimes have profound influences on post- 
treatment group observations* Consequently, researchers have a ser- 
^ious desire to control initial or potential-initial differences be- 
tween treatment groups. This desire leads them to the second category 
of approaches based upon mathematical methods designed to eliminate 
pre-treatment differences between groups when evaluating changes in 
those groups. 

A number of statistical models and computational procedures are 
included in this second category. Probably the most commonly used 
statistical approach to the control of pre-treatment differences is 
that of difference scores or simple change scores . This approach 
involves the subtraction ' of a pre-treatment observation from each 
post-treatment observation. Thus, a subject's change score is de- 
rived by subtracting his pretest score from his post-test score. The 
result is theoretically the change caused by the treatment. If the 
TTieasurement tool used has. a 100% reliability, tK*en this difference 
score should be a valid measurement of the change. The concern arises 
from the fact that most instruments used to measure educational or 
behavioral change are plagued by a degree of unreliability. Unre- 



ability is, in effect, error in the measurement process. It is 
assumed that this error is independerrt in each given observation. 
Therefore, each observed score is a function of what can be called a 
true score plus the measurement error associated with the observa- 
tion. If people are tested, whether pretested and/or post-tested, 
then each test score is composed of both a true score and the 
independent error component. If the people do not change between the 
two testings, then the two true scores for each should equal each 
other:^ In this case, obtained difference scores would contain no. 
..really true score, but rather be composed entirely of error. Like- 
^wise, theoretically, in situations where there H change, a differ- 
ence score would contain, the di|*ference between the true scores, or 
true-score delta, plus the error associated with the first measure- 
ment and plus the error associated with the second measurement. 
Though a measuring instrument may yield data with an acceptable level 
\of error, the difference scores resulting fl^om two uses pf the measure 
could contain a very high level of error. For example, if^^^uestion- 
n^ire had a reliability of .90, then approximately 80% of the variance 
of the scores would be so-called true scores and approximately 20% of 
the variance would be error.. If a treatment increased the true scores 
by 10% or accounted for lO% additional variance In the true scores in 
*the post-test, then the^true-score delta component of the difference 
score should reflect the l\)% variance in the post-test true score that 
is independent of the pretest true score variance. Hbwever, the error 
associated with the second as well as the first measurement are 



independent of each other and independent of the true scores. There- 
fore, the error component of the difference score would Include, or 
could Include, the 20% error variance from the first measurement plus 
the 20% error variance from the second measurement. The error level 
of the difference score would likely be the 20% a?iSociated with the 
measuring instrujDent and could be as high as the sum of the error 
levels associated with both of the two measurements. Theoretically, 
such a difference score would have a signal-to-noise ratio of 1:2 and 
could be as high as 1:4, This is in contrast to the signal-to-noise 
ratio of 8:2 normally associated with the questionnaire. Therefore, 
from a measurement theory perspective, the use of difference scores or 
simple change scores ts very questionable, 

DuBois (1957), Lord (1956; 1963), and McNemar (1958) have recom- 
mended the use of "^gsidual gain" sfores as a preferable substitute to 
tKe use of "raw gain" of scores. In this procedure, a gain is 
expressed as the deviation of post-test score from the post- 
test/preteSt regression line. Thus, the part of the post-test 
information that is lineaWy predictable from the pretest can be 
partialed out. The residual, or the residualized gain, is then used 
to evaluate the change by eliminating any pretest differences or 
biases in the difference scores. A concern witK the use of residual- 
• fzed gain^scores is the consequence of par{ialing out the pretest 
information. The information that the post-test and pretest scores ' 
^ have in common is what can be consider^d^s the true score component 

? of the pretest score. This true score component of the pretest score 
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has predictive value. The error component of the pretest score has no 

\ 

predictive value and» therefore, has no function in the reqressioh 
procedure. Ostensibly, when the pretest is used as a predictor, the 
effect is to remove the trujp score information from the post-test 
scores. The concern is that the residual or that which the pretest 
cannot predict is, in effect, the error in measurement plus the 
possible gain in true scores. Consequently, as with difference 
scores , residual i zed gain scores run the risk of bei ng primar i ly 
composed of error. 

A fourth approach to the analysis of- change is the multiple 
factor analysis of variance model. The treatment conditions are 
represented as a dimension in the analysis and the pretest versus 
post-test scores are two levels in an- additional dimension! The 
effect is to partition or separ^ite the various sources of variance 
such* as treatment effects and protest effects. Wi.th this model, the 
investigator is able to isolate and evaluate possible pretest differ- 
ences among the subjects as well as possible treatment differences 
between the subjects. If there is a difference or change due to one 
or more of the treatments, this will result in a greater pretest to 
post-test difference for one of the treatments in comparison to the 
other treatments. This effect will bft reflected in the interaction 
component of the analysis. Specifically, the change is evaluated by 
the £ ratio of thfe mean square interaction over the mean square error. 
However, the analysis of variance model assumes an independence among 
all observations. In the present situation, prel^est and post-test 



moasuromonts "iiannot be assumed to be entirely independent. Thus, this 
particular model is rare]y used and is included herein mainly for 
comparison purposes. 

A fifth approach to the analysis of change involves a refinement 
of the analysis of variance model which accommodates multiple mea- 
surements derived from the subjects. This procedure', which is 
commonly referred to as a repeated measures analysis of variance^ is 
computationally similar to the above-mentioned method of multiple 
factor analysis of variance. However, there are additional sums of 
squares and mean squares which reflect the effects of between-subject 
differences and the interaction between the subject and treatments. 
As wtth the multiple-factor analysis of variance, change is evaluated 
by the pre-post t^st and treatment interaction term. However, despite 
t'fio (1 met ions made in the theoretical foundations, the £ ratio for 
this int -arMon should have exactly the same value as the F ratio 
generated by a one-way analysis of variance uti lizing difference 
scores 'Jennings, 1972). 

A r xth approach to the analysis of change is the analysis of 
covariance model. The pretest is used as a covariate in an attempt to 
control for pretest differences between subjects. The variance of the 
^st-test scores that is linearly predictable from the pretest scores 

is partialed out. The model is similar to the analy)sis of variance of 

/ 

residualized gain scores except that the former is based upon within-- 

/ 

treatment group regression whereas the latter is based upon a regres- 
sion across the entire sample. One of the problems with this approeich 



Is that the traditional covarlAnce model assumes an independence of 
measurement of the covarlate and thd dependent variable. More 
.specif leal ly» there 1s a necessary assumption of the Independence of 
the error associated with each of the two measurements. Clearly, the 
use of pretest scores as a covarlate to analyse post-test scores 
violates this assumption. Furthermore, the analysis of covariance 
model also theoretically suffers the problem of high error levels. 
When the pretest Is used as a covarlate, the result is the removal of 
the true score information from the post-test scores that Is also 
contained in the pretest scores. The residual is the error of 
measurement plus any change in the true score values. The resultant 

information could contain a disproportionate amount of error. 

It 

The issues of the evaluation of change remain unresolved because^ 
the various theoretical positions approach the problem from different 
assumptions, and therefore have no common ground from which a common 
assessment can b^ made. A particularly important difference in 
perspectives is the concern over the proportion of error in change 
scores. The research community was' stunned, if not confused, by 
Overall and Woodward's (1975) demonstration that the power of tests of 
significance is maximum when the reliability the difference scores 
is zero. The best advice to date had been not to measure change at all 
(Cronbach & Furby^ 1970). 

In situations in which the uncertainties cannot be resolved in a 
theoretical manner, insight can often be gained from a Monte Carlo 
study. In this kind of study, arttficial data is generated such that 
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they: conform to af|psired structure. Various data sets are generated 
to reflect the' va.rious -differences of data ^j||^meters that are of 
concern. .Then one or more "data analysis mode.ls are used to analyze 
the data, to determine the * extent to which the^models give valid 
resalts^ In the case of the evaluation of change, ^ insight might be 
•gained^^y^^^fi^^^ating sets of data with known characteristics, such a^s 
G,rror level, treatment effects, and pretest differences, then apply- 
ing- the various approaches to the data sefs. The results -would 
provide a basis for a "direct comparison of the models. 
Method 



The basic method was to simulate the traditional treatment 
versus control group experiment in which each subject is pretested and 
post-tested. The two groups were composed of randomly assigned 
subjects, with an arbitrary number of 20 subjects per group. One 
group /represented the treatment group which received some kind of 
^experimental treatment and the othar group represented the tradi- 
tional control group which either received no treatment or a neutral 
treatment. Each hypothetical subject was measured on the partfcuTar 
dependent variable before the. administration of the treatment and was 
again measured on the same variable after the administration of the 

\ , : - 

treatment. Each pretest^ dbserv^ation Y^^j^ can be repr:esented as a 
function of t^^, a true, score, plus an error term, ^iji* such that the 
^expected value of any-V^j-j^ is equal to the true score t^. Each post- 
test observation Y^^j^ cain be repre^e.nted as a function of T_, plus 

the treatment ' effect, X., plus i...>, such that the^e?fcected value of 

J - • : 



^ij2 equal to Tj: + x^. T^^ represents the -true score associated 
with thfe p8ir±icular observatidn, x represents the change in the true 
score associated with a*particular treatment, and e^j^^ represents^ the^ 
error .associated with the particular observation of the particular 
subject. . ; 

The" general design was to .generate a random normal population 
that conformed to specific parameter values. these populations 
consisted of 6,500 observations each. Then, for each simulated 
experiment, there were 20 subjects or observations randomly selected 
from the population. 

Three basic parameters were explored; Several data sets were 
generated such that there were differences in the amount of change 
caused by the treatments. Varying proportions of error variance were 
incorporated in the pre- and post-test scoir^es. Furthermore, dif- 
ferent amounts -of pretest differences were represented in the data 
sets. 

Three treatment levels were explored. In the first level, there 
was no difference betweein the means of the parent populations. In the 
second lev^l, the. popule(tion means differed such that the expected 
ratio of the difference of means o*^ samples taken from eath of the two 
populations would equal 4.098, which would have an associated proba- 
bility of approximately 0.05. In the third level, the population 
means differed from each other such that the expected F ratio of the 
difference Of means of samples taken from each of the two populations 
would be 7.353, with an associated probability of approximately 0.01. 



Six levels of erji-or variance were explored. Samples were taken 
from populations^ of scores which were composed froi!i the following 
levels of error variance: 0%, 10%, 25%, 35%, 50%, and 60%, The 
variance of the scores that wasfcaTred error variance was unrelated to 
the variance of ' the scores that was regarded as true score variance. 
The error variance, a_^, was normally distributed with the mean, 
p = 0. The magnitud^ of oj- was dependent upon the relative amount of 
error variance in the particular population. 

In order not to confound the magnitude-of error variance with the 
magnitude of observed score variance, the variance of observed scores 
was maintained at a constant l.Q. The observed, scoyres were a 
linear combination of true scores and error components which were 
selectively multiplied by their weights and If x^^ is from a 
population of true scores, and if X2 is from a population of error 
term 
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\^ then the proportion of error variance in observed scores Y^^, 
which are a linear combination of x^^ and can be determined by the 
weights of the linear combination. 

If Y^^ = .c^x^ + c^x^ or if = C^y^^ + c|m^^. 

If 2 _ i;ancj „ 2 ^ theh a 2 = c ^ + c 2. 
^1 ^2 ^ ^ ^ 

' ■ ' . ■ t 

Simply, thus If a^^ = i, then o.^ = 1. ^ 

From the above equations, if x^^ is the true score component and X2 is 
the error component in the observed score Y, it can be seen that Q.^ 
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plus should equal 1»0 in all simulated conditions. The proportion 

of error variance, a 2 , was therefore equal to the squai'e ^f the 

^ Wnear weight for the error component, Thus, the weights for the 

linear combination can be computed by taking the square root of the 

respective piSfcentages of true score variance and error variance. 

Three amounts of pretest differences of initial between-group 

differences were also explored. While different models br techniques 

may or may not be able to handle various levels of -error or may 

I 

introduce different ^inds of distortions at different ^probability 
levels, the ultimate interest Is in how well each procedure is able to 
evaluate change validly even though there may be initial between- 
group differences. In the first level, there was no difference 
between the means of the pretest populations being sampled. In the ^ 
second level, the population means differed such that the expected JF 
rat jo of difference of means of samples takeri from each of the two 
populations would be 4.098, with an associated probability of approx- 
imately 0.05. In the third level, the population means differed from 
each other such that the expected £ ratio of the difference of means 
of samples taken from each of the two populations would be 7.353, with 
an associated probability of approximately 0.01. 

.In summary, populations were generated and subsequently sampled 



which met the following definitions: 



Pretest control: Y^^^ = c^T^^^ -h c^E^^^ 

Pretest treatment group: " ^1^121 ^ ^2^i2i 

■ V 11 • - 



where T is a true score from a standard narmal population, T N(l^O), 
E is. the/error component frdm a standard normal population, 
E - N(1,0), a is the treatment effect, tt is the initial between-group , 
difference, is the weight for the true scores, and is the weight 
for the error components. ^ 

The inclusion of varying amounts of error Variance is important 
in this kind of study. Since the-social scientist operates with data 
that have^ a substantial leve/ of error, it is of considerable impor- 
tance to see how varying le/els of error may influence the validity of 
the results of various procedures. In studies of this nature, there 
are various ways to interpret the meaning of ei^ror variance. In. this 
study^ the primary interpretation of error variance is that it 
reflects the reliability of the measuring instrument being simulated. 
The error levels of 0%, 10%, 25%, 35%, 50%, and 60% can be interpreted 
as representing respectively approximate test reliabilities of 1.00, 
0. 95-, 0.87, 0.80, 0.7X>, and 0.63. 

Three treatment levels, six error levels, and three pretest 
difference levels were Qtilized, thus 5^ original experiments were 

L 

simulated. Each time an experiment was -simulated, four new popula- 
tions, each of size 6^^500 and each of ^hich conformed to the above 
specifications, were generated. From each of these four populations, 
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a sample of 20 observations were randomly selected to simtj^ate a two 
group, pre- and post-test experiment. 

The data from each "experiment" were then analyzed using six 
models: 

1) One-way analysis of variance of post-test scores 

2) One-way analysis of variance of difference scores 

3) One-way analysis of variance of residualized gain scores 

4) Jwo-way analysis of variance 

5) Repeated measures two-way analysis of variance 

6) Analysis of covariahce . . ' 

The appropriate £ ratio tojevaluate tt^e^di^tn^e was computed for each 
model for each "experiment." 

The simulation of the 54 experiments was replicated a total of 50 
times. Therefore, an overall total of 2,700 experiments was simu-i* 
lated. Across the 50 replications, the mean of the F ratios for each 
model used in each "experiment" was computed. These mean £ ratios were 
used to evaluate the performance of the six models. The observed mean 
JF ratios were statistically compared with the expected £-ratio 
valuer Since t^e same "data" were analyzed with all six models, all 
models had a common basis of evaluation. 
Results 

In the cases in which the observed £ ratios were anticipated to 
be approximately equal to the expected P ratios, such as the one-w^iy 
analysis of variance of post-test scores with no pre-treatment dif- 
ferences, the observed mean £ ratios tended to be slightly greater' 

U ^ . 
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than ±he expected value, probably because of the highly skewed nature 
of the F distribution. Otherwise, the results indicate that the 
random number generator used to create the .populations worked ade- 
quately. The f luctuations^ from the expected values are within the 
range of sampling error. The mean £ ratios generated by each analysis 
procedure in each of the simulated experiments in which there were no 
initial betweeti-group^ifferences are .presented in Table 1. The mean 
£ ratios fjDr the simulated experiments using the second and third 
level of initial between-group difference? are respectively presented 
In Tables 2 and 3. * 

One-way analysis of variance of post-test scores > The observed 
mean JF ratios for the simulated experiments in which ^here were no 
initial between-group differences indicate that this procedure worked 
as expected. Only 1 of the 18 observed mean £ ratios was signi- 
fican1;ly different from the expected value. However, when initial 
pre-treatment differences are present, clearly invalid and misleading 
£ ratios are generated. All of the conditions with second and third 
level' pretest differences resulted in observed mean £ ratios for the 

one-way analysis of variance that were, significantly different from 

V ^ ■ ' ■ ' 

the /expected values. 

\ ' ' ■ • 

Two-way analysis of variance . The fact that this model violates 

the assumption of independence amongst pretest/post-test observations 

■ ' ■ 

w5s demonstrated by the invalid estimates of the treatment effects. 

'* ' . ■ 

The F ratios for^ the treatment dimension were relativley consistently 
lower than expected. Furthermore, the pretest differences effects 
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were cbnsistWtly underestimated by an fevefv larger margin. However, 
the value of the F ratios generated seemed to be relatively unaffected 
by the various levels''of error variance. The treatment-pre/po'st 
interaction effects proved to be clearly invalid estimates of the 
change effects. For thfese effects, the F ratios were all signifi-* 
cantly different from the expected values. » 

Twor:Way analysis of variance psing repeated measures . This 
procedure does not cause the analyst to make the unwarranted assump- 
tion' of complete' independence amongst all the scores. Instead, there 
is assumed^ to be a dependence among the pretest and post-test scores, 
and therefore is represented by the inclusion of. a subject dimension 
in the analysis. The observed £ ratios were relatively unaffected by 
the treatment .levels. For example, with no-pretest differences, at 
the 50% and 60% error levels; only one of the six observed £-ratio 
means was significantly different from the expected value. However, 
the amount of error variance vastly effected the validity of the 
observed F ratios. With no-pretest difference, the simulated experi- 
ments with 0% to 35% error produced 9 out of 12 pbserved mean £ ratios 
which weje. significantly different from the expected value. Even at 
the 35% error level where the expected F ratio was 7.353, the observed 
mean £ ratio was 11.171. At the 50% error variance level, the 
observed mean £ natio finally dropped to 9.389 and at the 60% error 
variance level, the mean £ ratio was 7.487. The estimates of change 
effects were relatively undisturbed by initial between-group differ- 
ences. Even at the third level of pretest differences, there was *^he 
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same number of observed mean/F ratios which differed significantly 
from the expected value. However, regardless of the amount of initial 
between-groyp differences, this model was adversely affected by error 
levels less thari 50%.y ^ ; i 

A nalysis of covariance^ This procedure was relatively un- 
affected by differen^s in /he treatment levels. However, the £ 
ratios generated by this procedure appeared to have been directly 
affected by the amount of error variance in the data. Only at the 6(W5 
level of error variance were the observ.ed mean £ ratios not signif- 
icantly different from the expected values. At the 35% error variance 
level when the expected £ ratio was 7.353, the observed mean F ratio 
was 13.400. At .the 50% error variance T^el , the observed mean £ 
ratio was 11.713, while again the expected value was 7.353. At the 
60% error varig^ce level for the third treatment level, the observed 
mean £ ratio was 8.620. When initial between-group differences were 
introduced, the observed mean £ ratios even further deviated .from the 
expected values. At the thjrd. level of pretest difference, the 
observed mean £ ratio for the 35% error level was 18.568 while the 
expected F ratio was 7.353. At the 50% error level for thi- third 
treatment level, t(ie observed mean £ ratio was. 17. 860; at l|he 6D% 
error variance leveT, the observed mean £ ratio was 17.667. 

One-way analysis of variance of differences scores . This pro- 
cedure produced exactly the same values for the £ ratios as the two- 
way analysis of variance using repeated measures. Even though the 
mean squares produced in the two procedures had different values, the 
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final F ratios were exactly identicaVto the tenth decimal place. An 
example of a comparison between the two sets of results is presented 
In Table 4. As i^ith the two-way analysis 0/ variance us ingv repeated 
measures, the one-way analysis of variarfce of difference scores 
^proved to be unaffected by the initial betwe'erfTgroup differences. 
However, the amount of error variance greatly affiected the validity of 
the £ ratios generated. ^ ? ^ 

One-way analysis of resldualized gain scores . The results of 
this procedure were very similar to those of th^e analysis of covar- 
iance, though generally, the £ ratios computed by this procedure were 
slightly lower than those computed by the analysis of covariance.' The 
value of the observed mean F ratios was greatly affected by-the^amount 
of error variance and the amount of initial betWeen-group difference. 
The error variance levels of 0% to 35% resulted in 9 out of 12 
observed mea^i F ratios being significantly different than the ex- 
pected value in the no-pretest:diff erence conditions. At the 50% and 
60% error levels, 1 of the 6 observed mean £ ratios was significantly 
different from the expected value. The introduction of between-group 
differences caused the observed F ratios to have even higher values, 
such that even at the '60% error variance level, when there were 
between-group differences at the third level, the observed mean £ 
ratios were on the order of. three times that of the expected ^-ratio 
value. At the second and third level of pretest differences, -32 of 36 
observed mean£Tatios were significantly different from the expected 
values. 
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Discussion * 

These results reflect two kinds of phenomenon. Ttie first is the 
ability of various statistical procedures to produce valid estimates 
-of change effects without being disturbed by possible initial be- 
tween-group differences. The second is the abi.lity of^the various 
procedures to validly compute change effects in the context ^f error 
variance. 

The effects of initial between-group differences, or pretest 
effects, is the ultimate purpose of this study for it is these very 
effects that the-^ models studied were designed to accommodate and 

V 

overcot^a. The very rationale for the riteasurement and the evaluation 
of change caused by some treatment is based upon the supposition that 
treatment effects can be best evaluated within the context of some 
kind of universal baseline. It has been urged that even the most 
robust .of between-group experimental designs ultimately contaminates 
the assessment of the change that occurs. Of the six procedures, only 
the analysis of variance of difference scores and the two-way analysis 
of variance using repeated measures apparently are not affected by 
pretest differences. These two models, which have , proven^ to be 
essentially the same, apparently are able to accurately assess the 
treatment effects regardless of any possible biases as to initial 
differences between groups. The analysis of covariance model is 
apparently insufficient in that it results in highly inflated 
ratios. Similarly, the analysis of variance of residualized gain 
scores is apparently Insufficient in that it also results in highly 
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inflated £ ratios. As expected, the one-way analysis of variance of 
post-test scores does not prove to be a valid way to estimate 
treatment effects when there are prior between-group differences. 
^While the two-way analysis of variance mode;l proves to be unaffected 
by pretest differences, it apparently produces low-estimates and, 
therefore, invalid estimates of the treatment effects. 

Probably the most important result of this study, or insight 
provided by this study, is the effect due to the amount of error 
variance. These six models evaluate treatment effects by using a 
.genera't statistical structure based upon two independent estimates of 
non-j^eatment variance (error variance) SLfch that the two estimates 
differ from each other as a function of expected sampling distribu- 
tions. These two independent estimates are used to form a ratio.. 
When the observed ratio is greater than the expetted ratio, then the 
investigator can j'nterpret the statistic as indicating the presence 
of a treatment effect. In a situation where there is no error of 
measurement, the only sources of variance are within-group (indi- 
vidual-differences variance) and treatment variance. ^ The one-way 
analysis of variance of post-test scores procedure is well suited to 
this situation. However, the elimination of pretest effects results 
in the removal of individual differences within each group and leaves 
jinly the between-group differences. When data are analyzed by 
extracting pretest values on an indi vidual-by-indi vidual^ basis, and 
if there is no error of measurement, then the resultant values or 
scores for an individual in a given group is the treatment effect 
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itself. In the absence of error, each individual experiences the same 
treatment effect. Therefore, each individual within a group has the 
same score. Thus, within-group variance is eliminated. When there is 
.no within-group variance, the necessary secolrid indepe^ndent estimate 
of the error is unavailable. Therefore, there is no basis for a 
statistical evaluation. 

In a condition in which there is no error, models such as 
analysis of covdriance, one-way analysis of variance of residualized 
gain scores, one-way analysis of variance of difference scores, and 
two-way analysis of variance using repeated measures, which "control" 
for pretest differences and thus eliminate within-group variance, 
were totally unable to validly evaluate the effects-of the treatments. 
This weakness in these models was not only apparent when the error of 
measurement was 10% and 25%, but also when the error was as high as 
35%. The analysts of covariance procedure continued to give invalid 
eKtimates of the change effects at the 50% level of error^ It should 
be kept in mind that a 35% error of measurement translates into a 
0.806 measurement reliability. 

The concern for error variance can be viewed within a wider 
context. The various populations, which were generated and then 
sampled, were defined in terms of constant treatment differences and 
constant pretest differences. These differences were constant in 
that every observation within a population differed by the same value 
from the observat j<>hs in the other populations. The populations were 
further defined in terms of 'proportion of error (though the error had 
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a mean of. zero) which was randomly assigned and added to each member 
of the populations. Thus, the populations had within-population or 
within-group variance that was interpreted as measurement error. 
Within this context, there was no treatment error in that every member 
of the population was affected by the treatment to the same degree. 
Furthermore, there were no other sources of error variance. Consider- 
ation was given also to exploring the impact of treatment error. This 
would reflect the more realistic situation in which a treatment 
affects subjects in slightly different degrees, more commonly called 
treatment by subject interaction. However, the inclusion of a second 
source of error variance would have resulted simply in a higher level 
of error variance, something already evaluated by the dimension of 
level of error of measurement. Theoretically, there would be no 
interactive effects between multiple sources of error since error 
variance is independent and has a mean of 0. A possible realistic 
exception is the situation in which the treatments have different 
levels of variance. In general^ the dimension of. measurement error can 
be interpreted within the broader context of general experimental 
error normally associated with educational and psychological experi- 
ments. 
Summary 

The results of this Monte Carlo study substantiate earlier 
concerns regarding the evaluation of change. The results revealed 
specific problems with each of the six statistical models.. While the 
behavioral and educational researcher may be able to measure various 
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cl\ange phenomenon, there is now serious question as to whether or not 
he or she is able to statistically evaluate the change. Analysis of 
covariance and analysis of variance of residualized gain scores 
appear to be entirely Inappropriate. Multiple factor anal,\|^is of 
variance models utilizing pretest and post-test scores appear to 
yield invalid F ratios. The analysis of variance of difference scores 
and the multiple factor analysis of variance using repeated measures 
are the only models which can adequately control for pre-treatment 
differences ;,'vJiowever, they appearWo be robust only when the error 
level is oil'jmore. This places serious doubt regarding published 
findings, and theories based upon change score analysis. When an 
inv,estigator is collecting data which have an error level less than 
50% (which is true in most situations), then a change score analysis 
is entirely inadvisable until an alterative analysis model is devel- 
oped. • , 
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Table 1 

Expected and Observed Mean F-ratto Values 
With. No Pre-Test Difference 
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^Expected 


One-way 
ANOVA 


Two-way 
ANOVA 


Two-way 
Repeated 
Measures 


Ana rys is 
of 

Lovan ance 


AnUVM Or 

Oifference 
Scores 


Residual 1 zed 


Treatment 
Level 


- 






OX Error 
























1 


1.056 


0.992 


0** 


0** 


0** 


0** 


0** 


2 


''4.098 


4.297 


1.636** 


•** 






4 J90.432** 


3 


n "SCO 

7.353 


o.Uoo 


3.362** 












1.056 


1.291 


0.085** 


lOX Error 
0.821 


0.879 


" ' 

0.821 


0.868 


2 


4.098 


3.654 


r.889** 


16.954** 


17.171** 


16.954** 


16.867** 


3, 


7.353 


8.495 


3.352** 


32.703** 


34 .^/a" 




JJ.JjfO 










25X Error - 








1 


1.056 


0.980 


0.263** 


1.020 


1.073 


1.020 


1,078 


2 


4.098 


4.249 


1.661** 


7.664** 


8.662** 


7.664** 


8.653** 


3 


7.353 


7 •DDI 


2.737** 


15.;278** 


17 1 7C** 
1 / • 1 /O^" 


i£ 97fl** 


17 1*;^^ 






















T- 35X Error - 








1 


1.056 


0.829 


0.354** 


1.025 


0.825 


1.025 


0.815 


2 


4.098 


4.817 


1.882** 


5.229** 


6.355** 


5.229** 


6.294** 


3 


1 

^ 7.353 


8.628 


4.052** 


11.171** 


13.400** 


11 .171** 


13. 279** 


1 


1.056 


1.082 


0.600** 


SOX Error - 
1.146 


1.164 


1.146 


1.153 


2 


4.098 


5.243 


2.371** 


4.814 


6.229** 


4.814 


• 6.293 


3 


7.353 


9.048*- 


4.683** 


9.389* 


11.7%* 


9.389* 


11.611** 








UUib 1.1 1 Ul 








1 


1.056 


0.892 


0.737**. 


1 .229 


0.977 


1.229 


' 0.971 


2 


3.098 


4.733 


2.212** 


3.911 


5.008 


3.911 


4.946^ 


3 


7.353 


6.644 


4.196** 


7.487 . 


8.620 


7.487 


a.S29 



* 

**^ 



p(t) < .05 
p(t) < .01 

t • (observed mean F-ratIo) - (expected F«rat1o) 

* , (standard deviation of .observed F-rat1o<)//J7 
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Table 2 

Expected\and Observed Mean F-ratio Values With 
SecomlJ ^eve l Pre-Test^fference 



ERIC 



Treatment 
Level 



Expected 
Value 



♦ p(t) < .05 
p(t) < ,01 

where t « 



0.464 
4.098 
7.353 



0.464 
4.098 

7.353 



0.464 
4.098 
7.353 



0.464 
4.098 
7.353 



0.464 
4.098 
7.353 



One-way 
ANOVA 



Two-way 
ANOVA 



Two-way 
Repeated 
Measures 



3.2^** 
16.618** 
19.452** 



5.735** 
15.502** 
20.424** 



4.193** 
13.396** 
20.021** 



4.001* 
16.923* 
19.201* 



0.464 
4.098 

7.353' 



4.405** 
14.421*' 
17.381** 



4.445** 
14.924** 
17.911** 



0** 
1.649** 
3.222** 



0. 140** 

1. '639** 
3.28$** 



0.232** 
1.835** 
3.965** 



0.361** 
2.175** 
3.404** 



0.364 

1.867** 

4.145** 



0.626** 
2.354** 
3.571** 



0% Error ■ 
0** 

10* Error 
1.472 
16.199** 
31.998** 
25% Error . 
0.368 
7.612** 
16.508** 
3S% Error ■ 
1.119 
6.098* 
9.459* 
50t EPi^r - 
1.935* 
3.594 
8.737 
60% Error - 
1.06^ 
3.848 
5.950 



Analysis 
of 

Covarlance 



0** 



1.787 
18.698** 
33.593** 



1.099 
11.288** 
22.409** 



1.753 
12.070** 
.16.560** 



2.824** 
9.106** 
15.259** 



2.490** 
10.529** 
14.104** 



ANOVA of 
Difference 
Scores 



1.472 
16.199** 
31.998** 



0.868 
7.612** 
1^^508** 



1.119 
6.098* 
^9.459* 



1 .935* 

3.994 

8,737 



1.062 
3.848 
5.950 



(observed mean F-ratIo) 7 ( 
(standard deviation of observed 



expected Frajild) 
rved F-rat1os)/r'55' 
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ANOVA of 
Residual Ized 
Gain Scores 



0** 

1,133.626* 
1,433.624* 



1.591 
16.423** 
28.747** 



0.978 
10.220** 
20.169** 



1.636 
10.414** 
15.082** 



2.645** 
7.938** 
14.202** 



2.326** 
9.460** 
12.888* 




Table 3 

Expected arid Observed Mean F-ratio Values Jith 
Third Level Pre-Test Difference 



■ 


Expected 
Value 


One-way 
ANOVA 


Two-way 
ANOVA 


Two-way 
Repeated 
Measures. 


Analysis 
of 

Covarlance 


ANOVA of 
Difference 
Scores 


ANOVA of 
Residual i zed 
fain Scores 


Treatment 
level 
























0% Error - 








i 


1 .056 


9.027** 


0** 


0** 




0** 


0** 


2 


4.098 


19.371** 


1.697** 






«•** 


664. 30O** ^ 


3 


7.353 


29.444** 


3,518** 




a.** 




371.070** 


1 


1.056 


7.171** 


0.109*^ 


10% Error - 
1.123 


1.489 


1.123 


1.304 


2 


4.098 


21.491** 


1.963** 


18.452** 


20.833** 


18.452** 


16.588** 


3 




7.^3 


30.270** 


3.489** 


34^333** 


' 34.995** 


34.333** 


25.979**^ 


1 


1.0^6 


8.809** . 


0.357** . 


25t Error - 

■1.39^, 


2.687** ;i 


■ 1" 


2.322** 


2 


4.098 


21.739** 


1.770** 


6.959t? 


11.553** 


6.959**' 


9.042**. 


3 


7.353 


26.600** 


3.999** 


16.276** 
35X Error - 


23.144** 


16.276** 


18.854** 


1 


1.056- 


8.839** 


0.528** 


1.539 


2.782** 


1.539 


2.339** 


2 


4.098 


18.473** 


1.596** 


4.944 


10.231** 


4.944 


8.312** 


3 


7.353 


27.416** 


3.852** 


10.723** 


18.568** 


in 721** 




1 


1.056 


7.944** 


0.564** 


)0% Error - 
1.129 


3.531** 


1.129 


3.088** 


,2 > 


4.098 


20^.788** 


1.954** 


4.202 


11.835** 


4.202 


9.502** 


3 


7.353 


25.636** 


3.729** 


7.961 


17.860** 


7.9fi1 


14.932** 








^fSS/ f 
























1 


1.056 


7.679** 


0.461** 


0.738* 


3.311** 


0.738* 


2.794** 


2 


^4;098 


19.867** 


2.238** 


4.151 


12.316** 


4.151 


9.863** 


3 


7.353 


28J35** 


3.169** 


5.558** 


17.667** 


5.558** 


13.675** 



p(t) 
Pit) 



.05 
.01 



where t * 
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(observed mean F-ratiq) - (expected F-ratio) . 
(Standard deviation of observed F-rat1os)//?g" 
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^ Table 4 

Summaries of One-way Analysis of Differences Scores and Two-way 
Analysis of Variance With Repeated Measures 



One-way Analysis of Variance* of Difference Scores 
Sum of Squares d.f. Mean Square 

Between • 16.549. 1 16.549 

Within 39,996 38 1.052 

Total 56.545 39 



F 

15.723 



Two-way Analysis of .Varianc!^ With Repeated Measures 







Sum of 




Mean 


* 




Squares 


d.f. 


Square 


Befweeir V 










Treatment 




"0.488 


1 


0.488 


V \redtment x Subject 
Among 




56.731 


38 


1.492 


Pre-Post Test 




4.210 


1 


4.210 


Treatment x Pre-Post 




8.274 


1 


^8.274 


Treatment x Pre-Post x 


Subject 


19.703 


38 


0.526 


Total 




89.702 


79 





.^327 



8.000 
15.72f 
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